Hierarchical deep learning time steppers

Nicolaj Hans Nielsen

DTU, August 2021

This notebook is primarily based on work from

Liu, Yuying and Kutz, J Nathan and Brunton, Steven L, Hierarchical Deep Learning of Multiscale Differential Equation Time-Steppers, 2020, arXiv preprint arXiv:2008.09768

Why is this relevant?

Dynamics at different scales

For almost every dynamical system, we have dynamics at different time scales. The Eye of the Storm is a great example where rapid dynamics is at close to the eye and slow dynamics at the periphery.

Picture from NOAA

Consider a city at the periphery of the storm. Information about the development of the storm is crucial for the locals to take appropriate action hence they want a model that can quickly predict the future.
To model these physical systems, we can derive systems of equations from first principles and with these, we can to some extend model the complex spatio-temporal development of the storm. Though very slow dynamics at the periphery could be modeled with a time-stepping regime with a large step size $\Delta t_{slow}$, we cannot benefit from this because the bottom-up approach simultaneous step forward the states of the entire spatial range of the storm. For the model to make precise predictions, it has to take many small steps in the eye of the storm where very fast dynamics takes place. The predictions of the model are bounded by $\Delta t_{fast}<<\Delta t_{slow}$ as the fast dynamics makes the system stiff. If we are only interested in the prediction at the periphery, we have to make a vast amount of unnecessary intermediate steps of size $\Delta t_{fast}$ to evolve with $\Delta t_{slow}$ ones. This motivates us to use multiscale models to avoid the computation of all these unnecessary intermediate function evaluations for the slow dynamic. Further, models derived from first principles rely on a set of basic assumptions about the physics that might not hold for real-world data where exogenous variables and microphysics could affect the system variables. This motivates us to use a data-driven approach where we find a model based on the data available.

Introducing the flow map

Consider a system with states $\mathbf{x}(t)\in \mathbb{R}^d$. We want to determine a set of maps $\{F_{\Delta t_{fast}}(\mathbf{x}), \: F_{\Delta t_{slow}}\left(\mathbf{x}\right)\}$ s.t.

$$ \begin{align*} x(t+\Delta t_{fast}) &= \mathbf{x}(t) + F_{\Delta t_{fast}}\left(\mathbf{x}(t)\right) \\ x(t+\Delta t_{slow}) &= \mathbf{x}(t) + F_{\Delta t_{slow}}\left(\mathbf{x}(t)\right) \end{align*} $$

This is the simple setup for flow maps for this multiscale system. Now we will address

Equation-free approach

How we determine these maps

Multiscale modeling is a field that already offers different techniques and averaging theorems to proscribe the dynamics of systems at different scales. However, what if we don't know how to describe the system, or perhaps we can only measure an insufficient number of variables to use formulas derived from first principles. In this notebook, we approximate the flow maps using neural networks. To theoretically guarantee that this is possible, we utilize the universal approximation theorem.

Let $F_{\Delta t}(\mathbf{x}): K \mapsto \mathbb{R}^d$ be a flow map with a fixed step size $\Delta t$. $F_{\Delta t}$ is a continuous function defined on a compact set $K\in \mathbb{R}^d$. For every $\epsilon>0$, there exists an approximation $\hat{F}_{\Delta t_{fast}}$ with a neural network with at least one hidden state and non-polynomial activation function $\phi (\cdot)$ s.t.

$$ \left\vert F_{\Delta t_{fast}} - \hat{F}_{\Delta t_{fast}} \right\vert < \epsilon $$

i.e. $\hat{F}_{\Delta t_{fast}}$ uniformly approximate $F_{\Delta t_{fast}}$ for all $\mathbf{x} \in K$.

Note that though $F_{\Delta t_{fast}}$ is a discrete flow map in time, it is a continuous function of the states $\mathbf{x}$ for a fixed step size $\Delta t$.

This means that theoretically, we can indeed find this approximation to the flow map.

Scale bridging

How we couple the flow maps

Consider one flow map $F_{\Delta t}$ only. To propagate the solution of the system, we will define the flow map s.t.

$$ \begin{align*} \mathbf{x}_{t+\Delta t}=\mathbf{x}_{t}+\mathbf{F}_{\Delta t} \left(\mathbf{x} \right) \end{align*} $$

and compose the flow map to take multiple steps in time:

$$ \begin{align*} \hat{\mathbf{x}}_{t+ 2 \Delta t}&= \mathbf{x}_{t} + \hat{\mathbf{F}}_{\Delta t} \left(\hat{\mathbf{F}}_{\Delta t} \left(\mathbf{x} \right)\right) \\ &=\mathbf{x}_{t}+\hat{\mathbf{F}}_{\Delta t} \circ \hat{\mathbf{F}}_{\Delta t} \end{align*} $$

likewise for $k>2$

$$ \begin{align*} \mathbf{x}_{t+ k \Delta t}&= \mathbf{x}_{t} + \mathbf{F}_{\Delta t}^{(k)} \\ &= \mathbf{x}_{t} + \underbrace{\boldsymbol{F}_{\Delta t} \circ \boldsymbol{F}_{\Delta t} \circ \cdots \circ \boldsymbol{F}_{\Delta t}}_{k \text { times }} \end{align*} $$

However, as we will approximate the flow maps using a neural network, we introduce $\hat{\mathbf{F}}_{\Delta t}$ and the approximation error $\delta$.

$$ \begin{align} \mathbf{x}_{t+ \Delta t}&= \mathbf{x}_{t} + \hat{\mathbf{F}}_{\Delta t} + \delta_{\Delta t}\\ \end{align} $$

This approximation error will propagate for multiple compositions s.t.

$$ \begin{align} \mathbf{x}_{t+ k \Delta t}&= \mathbf{x}_{t} + \hat{\mathbf{F}}_{\Delta t}^{(k)} + \delta_{k \Delta t} \\ \end{align} $$

As stipulated with $\delta$, the $\delta_{k \Delta t}$ depends on the system and step size $\Delta t$. However, if we do long simulations, then for $\Delta t<<$, the approximation error could start to accumulate quite fast for $k\rightarrow \infty$. Therefore, we define a hierarchy of flow maps.
Consider the set $\{\hat{F}_{\Delta t_{fast}}(\mathbf{x}), \: \hat{F}_{\Delta t_{slow}}\mathbf{x}\}$ and let the flow map with the highest $\Delta_t$ correct and reset the computed trajectory from $\{\hat{F}_{\Delta t_{fast}}(\mathbf{x})$.
So here as $\Delta t_{fast} << \Delta t_{slow}$ we require $\Delta t_{fast}$ to be divisible by $\Delta t_{slow}$ and the correction happens whenever $t+\Delta t_{slow} = k \Delta t_{fast}$ for $k\in \mathbb{Z}$.

Liu, Yuying et al. created an illustration of this for a system with fast, medium and slow flow maps with step sizes $\Delta t_1, \Delta t_2, \Delta t_1$ respectively with the notation $\hat{\mathbf{F}}_{1}\left(\cdot ; \Delta t_{1}\right)$, $\hat{\mathbf{F}}_{2}\left(\cdot ; \Delta t_{2}\right)$ and $\hat{\mathbf{F}}_{1}\left(\cdot ; \Delta t_{1}\right)$.

Each time the maps overlap, the computed $\hat{\mathbf{x}}$ is corrected by the map of larger $\Delta t$

Explicitly, we believe the larger flow maps are more correct i.e. $k_1 \delta_{fast} > k_2 \delta_{medium} > \delta_{slow}$ at instances where $t+k_1 \Delta t_{fast} =t+ k_2 \Delta t_{medium} =t+ \Delta_{slow}$ for $k_1,k_2 \in \mathbb{Z}$ and $k_1< k_2$.

Advantages and disadvantages of this method

We will more specifically focus on the advantages when compared with

Sequential modelling

As this method consist of a hierarchy of flow maps, we can train each individual flow map in parallel and on limit data. Sequential modeling techniques as residual neural networks and long short-term memory networks keep an internal hidden state to pass on information about the series. These often encounter problems with training. The network needs to be trained on very long data series hence they would quickly encounter problems with vanishing gradients when we do backpropagation to train the network.

Classical numerical integration

To understand the advantages and disadvantages, we quickly recap how classical numerical integration techniques work. For a system with $\mathbf{x}(t)\in \mathbb{R}^d$ let

$$ \begin{align*} \frac{\mathrm{d}}{\mathrm{d} t} \mathbf{x}(t)= \mathbf{f} \left(\mathbf{x}(t),t\right) \end{align*} $$

proscribe the dynamics. Then we could define a flow map:

$$ \begin{align*} x(t+\Delta t_{fast}) &= F_{\Delta t_{fast}}(\mathbf{x}(t)) \\ \end{align*} $$

a classical numerical integration scheme would approximate this through an intermediate equality using $f$:

$$ \begin{align*} x(t+\Delta t_{fast}) &= F_{\Delta t_{fast}}(\mathbf{x}(t)) \\ &\triangleq \left.\int_t^{t+\Delta t}\mathbf{f} \left(\mathbf{x}(\tau),\tau \right)\right. \mathrm{d}\tau \end{align*} $$

the integral can be approximated through a sum of intermediate steps from some $k$ to $n$. The simplest of these is the forward Euler

$$ \begin{align*} x_{k+1} \approx x_k + \Delta t_k \mathbf{f}(\mathbf{x}_k,t_k) \end{align*} $$

which we see directly as the term of a Taylor series expansion at $x_k$. Let $\mathbf{x}_{k+1} = \mathbf{x}(t+\Delta t_k)$ and $\mathbf{x}_k=\mathbf{x}(t)$ then

$$ \begin{align*} \mathbf{x}(t+\Delta t_k) &= \mathbf(x) + \Delta t_k \mathbf(x)^{\prime} + \frac{\Delta t_k^2}{2} \mathbf(x)^{\prime \prime}+ \frac{\Delta t_k^3}{3!}\mathbf(x)^{\prime \prime \prime} + ... \\ &\approx \mathbf(x) + \Delta t_k \mathbf(x)^{\prime} \\ & \triangleq \mathbf(x) + \Delta t_k \mathbf{f}(\mathbf{x}) \end{align*} $$

i.e.

$$ \begin{align*} \frac{x(t+\Delta t_k)-x(t)}{\Delta t_k} = \mathbf{f}(\mathbf{x}(t),t) + \mathcal{O} (\Delta t_k) \end{align*} $$

notice the $\mathcal{O} (\Delta t_k)$. No matter the system we are always bounded by $\mathcal{O} (\Delta t_k)$. We cannot take any larger steps without a loss in precision and accuracy. This means we need to take lots of intermediate steps making simulations computationally expensive. Of cause more sophisticated schemes as the Runge-Kutta 4 lower error metrics. RK4 has a local error of $\mathcal{O}(\Delta t_k^4)$ and a global error of $\mathcal{O}(\Delta t_k^5)$, however, the methods are still bounded by this very local $\Delta t_k$.

If we can instead approximate a flow map, $\hat{F}_{\Delta t_{flow}}$ of step size $\Delta t_{flow}>>\Delta t_k$, then we avoid the intermediate steps. Further, if we use the presented hierarchy of flow maps with the correcting scale bridging, we can run many of the computations in parallel which potentially makes the computations fast.

Hybrid methods are also possible s.t. the lowest flow map with lowest $\Delta t$ actually is a numerical integration scheme and the larger maps are approximated by neural networks.

Training

Each map is trained separately on data and in the following, we should consider each dataset as created only for the specific flow map of interest. We will essentially train a flow map plus the identity as presented in the section above $x_{t+\Delta t} = x_t +\hat{F}_{\Delta t_{flow}} + \epsilon$ where $\epsilon$ is an approximation error. We essentially train on the difference

$$ \begin{align*} x_{t+\Delta t} - x_t= \hat{F}_{\Delta t_{flow}} + \epsilon \end{align*} $$

we want the model to have good predictive abilities and it is very essential that the map has low approximation error even when we take multiple steps. We create a composition of the same flow maps when we move $k$ steps $ \hat{\mathbf{F}}^{(k)} = \underbrace{\hat{\boldsymbol{F}}_{j} \circ \hat{\boldsymbol{F}}_{j} \circ \cdots \circ \hat{\boldsymbol{F}}_{j}}_{k \text { times }}$ s.t.

$$ \begin{align*} \hat{\mathbf{x}}_{t+k\Delta t} = \mathbf{x}_{t+ \Delta t} + \hat{\mathbf{F}}^{(k)} \end{align*} $$

Consider some data $\mathcal{D}_{\Delta t}$ with the exact same $\Delta t$

$$ \begin{align*} \mathcal{D}_{\Delta t} = {x_t, x_{t+\Delta t}, ..., x_{t+k \Delta t}} \end{align*} $$

as input the flow map will only get $x_t$, then it will create a series of predictions until the $k$'th s.t. $\hat{\mathbf{x}}=\{\hat{x}_t, \hat{x}_{t+\Delta t}, ..., \hat{x}_{t+k \Delta t}\}$. We use MSE for our loss function

$$ \begin{align*} \ell=\frac{1}{n} \sum_{i=0}^{k-1} \left[\mathbf{x}_{t+i\Delta t}-\left(x_t + \hat{\mathbf{x}}_{t+i\Delta t}\right)\right]^2 \end{align*} $$

with a fixed network architecture, let the parameters of that network architecture be denoted by $\Theta$

$$ \begin{align*} \ell(\Theta)=\frac{1}{k} \sum_{i=0}^{k-1} \left[\mathbf{x}_{t+i\Delta t}- \left(x_t + \hat{\mathbf{F}}^{(i)}(\Theta)\right) \right]^2 \end{align*} $$

now we sample $n$ trajectories with a different $x_t$:

$$ \begin{align*} \mathcal{D}_{\Delta t}^{(j)} = {x_t^{(j)}, x_{t+\Delta t}^{(j)}, ..., x_{t+k \Delta t}^{(j)}} \end{align*} $$

we can now define a loss function for all $n$ trajectories

$$ \begin{align*} \ell(\Theta)=\frac{1}{nk} \sum_{j=0}^{n-1} \sum_{i=0}^{k-1} \left[\mathbf{x}_{t+i\Delta t}^{(j)}- \left(x_t^{(i)} + \left\{\hat{\mathbf{F}}^{(i)}(\Theta)\right\}^{(j)} \right) \right]^2 \end{align*} $$

We follow the same procedure for each of the flow maps $\hat{F}_{\Delta t_{i}}$ of $\Delta t_{i}$. To make sure we train on the same data, we use the same initial condition for each trajectory.

Optimal bridge coupling

In the examples below, we will first train 10 flow maps following the above procedure. Now we want to find the optimal coupling of these. Consider a set of 10 flow maps $\{\hat{F}_{1},\hat{F}_{2},...,\hat{F}_{10}\}$ that constitute a model when coupled. The set could contain a range of nested models e.g. $\{\hat{F}_{2},...,\hat{F}_{10}\}$ and $\{\hat{F}_{1},\hat{F}_{2},...,\hat{F}_{9}\}$ that perform as good or better. To select the most appropriate model, we define the model selection procedure. Find an optimal granularity of the upper bound flow map, $\{\hat{F}_{1},\hat{F}_{2},...,\hat{F}_{u^*}\}$ s.t.

$$ \begin{align} u* &= \arg \min_u \frac{1}{N} \sum_{i=0}^{N-1} \ell_i(u), &&u^*\leq 10 \end{align} $$

we do this on the validation data to ensure the selection is not too optimistic. As $\ell_i(u)$ we will use MSE evaluated on the validation dataset.

With $u^*$ we find the optimal granularity of the lower bound flow map $\{\hat{F}_{l^*},\hat{F}_{l^*+1},...,\hat{F}_{u^*}\}$ s.t.

$$ \begin{align} l^* &= \arg \min_l \frac{1}{N} \sum_{i=0}^{N-1} \ell_i(l), &&l^*\geq 1 \end{align} $$

We now have the optimal set of flow maps, $\{\hat{F}_{l^*},\hat{F}_{l^*+1},...,\hat{F}_{u^*}\}$, for our model. This model we will deploy on the test data to assess the model performance.

Application

We will test the machinery on 3 different systems

We go through data generation and training for each of them. All these systems are deterministic simple examples for us to understand how this machinery works on different systems we know well.

Then we will test the model on wind speeds from DMI measured at Kegnæs fyr. Train data will be measurements in the period from 2018 to the beginning of 2021. 2021 will be reserved for validation and test data.

Each of the case studies will be described in detail below.

For all case studies, we will use the same architecture and setup:

Architecture and setup

We use a neural network with layers $2-128-128-128-2$, ReLU activation functions with implementation in the PyTorch framework.

For the simulated case studies, we sample datapoints equidistant points with $\Delta t=0.01$. For these the ultimate lowerbound is therefore $\hat{F}_{\Delta t}$. We then generate flow maps with granularities $2^{i} \Delta t$ for $i\in \{0,1,2,..,10\}$.
For ease of notation, we let $\{\hat{F}_{\Delta t},\hat{F}_{2 \Delta t},...,\hat{F}_{1024\Delta t}\}=\{\hat{F}_{1},\hat{F}_{2},...,\hat{F}_{10}\}$.

We will set $k=5$ hence we train each network to map 5 times with their respective step size.

The flow maps considered for wind speeds are described in the wind speed section below.

Training

We will use stochastic gradient descent with a batch size of 320 and an adam optimizer with a learning rate 1e-3 and do 5000 epochs. First, we train each flow map separately. Then, we will turn to the HiTS setup and use cross-validation to find the upper and lower map needed to get the best score in terms of MSE.

The choice of parameters for the above will become more evident when we consider a specific system.

Case studies

The Hyperbolic system


We first look at a hyperbolic system

\begin{split} \dot{x} &= \mu x \\ \dot{y} &= \lambda(y-x^2) \end{split}

with $\mu = -0.05$ and $\lambda = -1.0$.

We compute 1600 trajectories for training, 320 for validation, and 320 for the test dataset. Each initial condition will be sampled from a uniform distribution s.t. $x_0, y_0 \sim \mathcal{U}_{[-1,1]}$.

Consider one trajectory of the system:

as $\Delta t=0.01$ and we simulate $5120$ steps for each trajectory, the time is from $0$ to $51.20$. We need $5120$ steps as each flow map is trained on 5 points and the largest flow maps $2^{10}$ $\Delta t$s in time i.e. $2^{10}\cdot 5=5210$

We will now plot an ensemble of trajectories with different initial conditions. First, we plot $160$ trajectories for $x$ and then for $y$

we can also plot in $(x,y)$ coordinates i.e. in the phase space of the dynamical system

From the plots above, we see that the dynamics of the system is quite simple and converges asymptotically towards zero in both states.

Remember that even though we have a long time series for each trajectory, we feed each flow map with 5 points from each trajectory corresponding to their $\Delta t$ s.t. the models are presented to an equal amount of data. This can be visually seen in the analysis of training patterns below.

Explorative analysis of training patterns

We will initially focus on two flow maps $\hat{F}_{1}$ and $\hat{F}_{10}$, however, all plots and gifs are available for all flow maps in the folder 3_week_data_driven_dyn_sys/multiscale_HiTS/figures/hyperbolic/training/model_D1_noise0.0 of the repository.

Training patterns for $\hat{F}_{1}$

Train log loss for $\hat{F}_{1}$ i.e. a flow map of $\Delta t$ when trained for 5000 epochs

we notice a dramatic decrease in the log loss function until a stagnation around 800. The Adam optimizer senses the stagnation and tries out new parameter combinations to search for better local optimizers. This gives rise to the fluctuations for the remaining epochs.

Validation log loss for $\hat{F}_{1}$ evaluated at each of the 5000 epochs

we see similar patterns and errors of the same scale which indicates that the good model performance is not due to overfitting the training data.

to have a visual idea about what it means to have an error of this magnitude, we create a plot every 10th epoch. We display the first trajectory in the validation set with the found flow map. As $k=5$, we will see 5 points

The dramatic decrease in the loss function can be seen visually as the trained map fits the data better. The fluctuation in the loss function before does not seem to visually affect the 5 step prediction much.

Training pattern for $\hat{F}_{10}$

Train log loss for $\hat{F}_{10}$ i.e. we go $1024 \Delta t$ steps forward for each map. We train for 5000 epochs

The same pattern is seen again, however, notice that the errors are now slightly greater than before.

Validation log loss for $\hat{F}_{10}$

The validation loss also follows the training loss hence the model does not seem to overfit the data. We visually inspect how the training progresses

The pattern is very similar to that of $\hat{F}_{1}$.

Analysis of individual flow map

With the trained flow maps, we can now visually inspect their individual performance on the test data. We let each map start with the same initial condition and let them propagate through time. Between points, we use linear interpolation

We see that the errors start to propagate for maps of small $\Delta t$ whereas the maps of large $\Delta t$ is more correct as time progresses. However, as we use linear interpolation, the errors between points for large $\Delta t$ are huge. The large errors accumulation between points does give a sense of the large maps in time.

Inspect simulations of data with different flow maps

For $\Delta t=0.01$ we see that though the errors seem to accumulate a lot on the error plot, we cannot visually see that much of a difference.

In this plot, we see how the linear interpolation gives rise to large error accumulation between maps from the large $\Delta t$.

We can now turn to the model selection and take out only the flow maps in the ends that don't improve the predictive power of the model.

Optimal bridge coupling and final model

Using the described model selection procedure, we found that $u^*=4$ and $l^*=11$ for the hyperbolic system. To compare individual maps with the performance of the coupled model, we again display the losses:

We see how the HiTS outperforms or is at least as good as the other single maps. A visual inspection on the test data confirms that with this multiscale model, the performance is good:

The cubic oscillator


We will now consider the cubic oscillator \begin{split} \dot{x} &= -0.1x^3 + 2y^3 \\ \dot{y} &= -2x^3 - 0.1y^3 \end{split}

It is a damped oscillator which we can see from one trajectory of the system

These are seen as spiral in phase space

As with the hyperbolic system, there is some attractor of the system but this time the system oscillates hence it will oppose a different challenge to the system.

We compute 3200 train trajectories and 320 validation and test trajectories. The initial conditions are sampled from a uniform distribution as before. To qualitatively grasp what different initial conditions mean, we plot an ensemble of $160$ trajectories for $x$ and $y$

From these plots, we can deduce that all trajectories will have dampen behavior. Further, all trajectories seem to be inside some bounded interval that shrinks as time progresses.

The ensemble in $(x,y)$ coordinates also show very explicitly these dampened spirals

We will now track training patterns for the flow maps:

Training patterns for $\hat{F}_{1}$ and the cubic oscillator

Train log loss for $k=0$ i.e. a flow map of $\Delta t$ when trained for 5000 epochs

Like the training pattern of the hyperbolic system, the training improves the map rapidly within the first 800 epochs after which it stagnates. From there, the errors start to fluctuate as the Adam optimizer searches for new optimizers. We can now check if the validation loss follows the same pattern:

Validation log loss for the flow map of $\Delta t$

the errors are of the same magnitude hence we are not overfitting the data when we test out of sample. To understand what this means visually for the map, we display the maps

we see that though the loss function seems to fluctuate, the fluctuations occour at a magnitude that visually doesn't seem to affect the 5 steps forward much.

Training pattern for $\hat{F}_{10}$ and the cubic oscillator

Train log loss for $k=10$ i.e. map of $1024 \Delta t$ steps. We train for 5000 epochs

Notice that the errors are of a very different magnitude and there seems to be a negative trend asymptotically. Here we see the strength of the Adam optimizer as it tries new combinations and avoids local optima. The validation loss is not included here but it looks exactly like the training loss hence no overfitting.

Since the errors are of a different magnitude, it is a very different visual picture we see when we inspect training over time:

we can inspect the loss function and the gif simultaneously to see that many of the new combinations do not not always lead to an improved model at first. Further, it seems that this very large flow map is too large to encapsulate all details of the system.

Analysis of individual flow map for the cubic oscillator

To assess all flow maps, we let each flow map start with the same initial condition and we cannot plot the errors over time

We see that the errors start to propagate for maps of small $\Delta t$ and the large flow maps are more correct over time, however, for this system, there seems to be some maps with just exactly the right $\Delta t$. In this case, these are $\Delta t=0.32$ and $\Delta t = 0.64$.

This time it is more obvious how errors start to propagate for small $\Delta t$:

On the other hand, the flow map of size $\Delta t=0.32$ performs well over time and seems just right for the system

The very large flow map does not perform well:

The map with a large $\Delta t$ is trained at different granularity hence one should perhaps look at a point forecast instead of this with linear interpolations, however, the message is clear. The model selection procedure should exclude some of the maps with very small and very large $\Delta t$.

Model with found bridge coupling

For the cubic oscillator, we found that $u^*=4$ and $l^*=5$ i.e. flow maps of size $\Delta t=0.08$ and $\Delta t= 0.32$. We can display the found coupled model with each of the individual flow maps:

We see that the multiscale model is not always better than the individual models, however, most of the time it is. It seems that the model-building procedure could be improved perhaps by letting some flow maps be used in specific intervals of time or select the flow maps in a different way. By now we only allow flow maps to be taken in a sequence e.g. flow map $\{\hat{F}_{3},\hat{F}_{4},\hat{F}_{5}\}$, however, it could be that $\{\hat{F}_{3},\hat{F}_{5}\}$ was a better combination but we do not check this combination. A problem is now the number of different combinations we would encounter.

The found multiscale model performs reasonably well when visually inspected with data, however, in the very end we see some deviation:

We conclude that this multiscale approach also works well on this cubic damped oscillator and for this system, there seems to be just a right $\Delta t$ to encapsulate the system dynamics. Very large and very small flow maps did not perform well here.

Van der Pol oscillator


The system is determined by

\begin{split} \dot{x} &= y \\ \dot{y} &= \mu(1-x^2)y - x \end{split}

where we set $\mu=2.0$. On realization of the system looks like:

the system exhibits repeating patterns that do not dampen. This opposes a new challenge for the flow map. In phase-space, the dynamics is

where we more explicitly see the relationship between the variables in the oscillations.

We do 3200 train trajectories and 320 validation and test trajectories and sampled with different initial conditions. The qualitative behavior can be assessed from 160 trajectories of the system:

After some initial irregular fluctuations, the system exhibits regular, repeating patterns and all trajectories reside inside an interval that remains fixed over time. There is no damping in this system.

we can also plot in $(x,y)$ coordinates i.e. in the phase space of the dynamical system

here we see how all the trajectories with different initial conditions converge towards this cyclic attractor.

Training patterns for $\hat{F}_{1}$ and Van der Pol oscillator

Train log loss for $\hat{F}_{1}$ i.e. a flow map of $\Delta t$ trained for 5000 epochs

we see very fast improvements in training and the stagnation starts already after 500 epochs. From there, we see large fluctuations again for the other systems. The system still performs well out of sample as the validation loss follows the training loss closely:

Validation log loss for $\hat{F}_{1}$

The errors are really small hence we will expect the model to follow data closely:

Training pattern for $\hat{F}_{10}$ and Van der Pol oscillator

Train log loss for $\hat{F}_{10}$ i.e. we go $1024 \Delta t$ steps in time

The errors are of a very different magnitude but likewise is the timescale. The validation loss again follow the train loss closely and the visual inspection of the training over time really shows how the model seems to be unable to find the right combinations of parameters

We will now assess the remaining flow maps:

Analysis of individual flow map for Van der Pol oscillator

For each of the trained flow maps, we can now plot the errors:

Very interestingly, we see that the flow map of $\Delta t =0.32$ seems appropriate for this system which was similar to what we saw for the Cubic oscillator. We also see that the large flow map $\Delta t =10.24$ performs quite well hence in this case it would perhaps be very beneficial to have a new model selection procedure that tries out combinations of flow maps and not just determines the best sequence of flow maps.

We now inspect some check out the performance of $\Delta t =0.32$.

The map seems to capture the dynamics very accurately and only at the tips with very rapid change, it has some discrepancies.

Model with best found bridge coupling

We found that $u^*=3$ and $l^*=6$ are the optimal flow maps hence the largest flow map is exactly $\Delta 32$. It probably just needs the smaller flow maps to accurately capture the very fast dynamics at the spikes. The multiscale model is equal to or better than any individual flow map for the Van der Pol oscillator

With the inclusion of the smaller maps, the model is now also accurate at the areas with rapid dynamics

We conclude that the flow maps are also effective to model systems with non-damping oscillations.

Wind speed at Kegnæs fyr with data from DMI

In a separate notebook, we explore the DMI API and do all data preprocessing. Many unexpected problems arose which we will briefly cover in this notebook to showcase the limitations of the dataset and how we handle these to use hierarchical time stepping algorithm for this data. In this notebook, we only consider the wind speed, however, in the SINDy notebook, we will combine this with the wind direction.

Initial data exploration

Consider first 10 min average windspeeds from 2018-2021 i.e. until the 31 dec 2020 which we will use as our training data.

10 min averages are the lowest granularity that the DMI API offers and we need quite some data to train the flow maps. To validate and test the found models, we will use the wind speeds from 2021. The wind speed in this period looks like

For both datasets, we see some patterns, however, stochasticity is definitely also present in the system which is an element we have not considered for the simulated case studies. For now, we will try to denoise or filter the data but simply test the model performance directly.

Choose the flow maps according to data available

When we choose $\Delta t$ for the flow maps, we must consider the limitations of our dataset. We have 1096 days in the training data, and we want to make sure that we have enough data to train the largest flow map as well. It should be able to map at least twice within each batch. We choose

Therefore, we will subset the data into 500 batches - each with two days of $\Delta t$s. We cannot directly use data as we have quite some days with missing values.

Missing values 2018-2021

The DMI API returns all the wind speeds it holds for the period, however, instances of missing values are not nan encoded hence one will have to infer these. We solved it by sampling all the 10 min frequencies in 2018-2021. Then we could see where the dataset didn't match the sampled date and times. The following plot shows instances of missing values:

Therefore, we first group the dataset into days intervals and sum up the number of missing values for each day. In each day, there are $6\cdot24=144$ datapoints and choose to discard dates with more than 6 missing values. If a day has 6 or fewer missing values, we forward fill the missing data points.

Training of DMI data model

We will now investigate the training patterns when we train the model in the same manner as we did with the simulated case studies.

Training patterns for the different $\Delta t$s for DMI data model

Overall we see the same pattern where the flow maps struggle to improve

Train for $\Delta t$ i.e. 10 min

We see that the errors are equal to the magnitude of the wind speed. Further, the errors seem to fluctuate without any kind of initial or asymptotic convergence as we saw with the simulated data. The exact same pattern is seen if we look at a specific train data set:

The flow map very poorly fits the data and adjusts randomly for each epoch without signs of improvements.

Train for $6\cdot \Delta t$ i.e. one hour

We see errors equal to or larger than the magnitude of the windspeed i.e. the model does not catch any of the seasonality, trend or patterns in the data.

For the largest $\Delta t$ it is even worse:

Train for $144\cdot \Delta t$ i.e. one day

The magnitude of the error is definitely larger than the magnitude of the wind speed.

Analysis of individual flow maps for DMI data

We see that the smallest flow maps start to accumulate a lot of errors, however, none of the flow maps seems to perform well as wind speeds are mostly of order $10^1$ which is the order at which our error is. Note, it is not entirely fair to assess the small $\Delta t$ on the same span as the large $\Delta t$ as it isn't trained for long time projections.

We can now inspect simulations on the test data with some of the flow maps.

For the small $\Delta t$ the flow maps quickly go towards a constant.

First for $\Delta t=10$ min

Then also for $\Delta t=30$ min. In both cases the converge towards the same value which is very likely to be the mean for the entire period.

For larger flow maps, they seem to capture some of the trends in this case. First for the map for 12 hours:

then the map for one day:

Since the mean error for the large flow map is still high, there could be examples where the large models do not catch any of the trend at all.

Model with best found bridge coupling

Though it will probably not improve a lot, for the sake of interest we try to couple them optimally. We find that $u^*=72$ and $l^*=144$ hence the two largest flow maps seem to perform best on this data. The errors are now:

We can visually assess the performance of the multiscale model:

The maps seem to get some of the trends in data, however, the errors are still quite large.

Discussion for DMI

The model does not perform well on the windspeed from Kegnæs fyr provided by DMI. The maps with small $\Delta t$ seem to converge towards a constant over time and errors start to accumulate quite fast. The larger flow maps sometimes catch some of the trends but still have errors with the same magnitude as the wind speed. To improve the model, we should consider a transformation of the data e.g. a Box-Cox transform to obtain a more constant variance, denoise the system using e.g. the first modes of a Fourier transformation or even some exponential smoothing before presented to the model. Further, the sampling technique could be changed s.t. batches are created for each individual flow map. The flow map with the low $\Delta t$ only trains on 2-4 points within each batch. The map with lowest $\Delta t$ could train 48 times within each batch. We only trained for 5000 epochs hence one could also try to set up this number to see if the maps suddenly find a better set of parameters. Windspeeds are very hard to predict hence it could also be that we are too optimistic about what is achievable for our model. We should perhaps focus on a much smaller time horizon e.g. 5 hours only. We could also try to include wind direction to improve the model predictions.

Conclusion

The hierarchical time stepping algorithm is a powerful tool for modeling multiscale systems. The has great performance on the simple deterministic hyperbolic system, on the Cubic oscillator that is damped, and the Van der Pol oscillator. In the general case in the simulated examples, the maps with a small $\Delta t$ quickly start to accumulate large errors over time while the flow maps with large $\Delta t$ can identify long-term dynamics. However, for both oscillators, there seems to be a flow map with an appropriate $\Delta t$ that encapsulates the majority of the patterns in the system and only a few extra maps are needed for the last corrections. When we test the model on wind speed data from DMI, we see quite low performance. Data transformations, new sampling techniques, new training patterns, and adequate time horizons could be some measures to take into account to improve the performance of the model. We will try some of these techniques when we explore the SINDy in the next notebook.